{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "punL79CN7Ox6"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "_ckMIh7O7s6D"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qzwilbae73N4"
      },
      "source": [
        "# Tokenize and sequence a bigger corpus of text\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S5Uhzt6vVIB2"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l09c03_nlp_prepare_larger_text_corpus.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l09c03_nlp_prepare_larger_text_corpus.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VcB-N6WrAT9q"
      },
      "source": [
        "So far, you have written some test sentences and generated a word index and then created sequences for the sentences. \n",
        "\n",
        "Now you will tokenize and sequence a larger body of text, specifically reviews from Amazon and Yelp. \n",
        "\n",
        "## About the dataset\n",
        "\n",
        "You will use a dataset containing Amazon and Yelp reviews of products and restaurants. This dataset was originally extracted from [Kaggle](https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set).\n",
        "\n",
        "The dataset includes reviews, and each review is labelled as 0 (bad) or 1 (good). However, in this exercise, you will only work with the reviews, not the labels, to practice tokenizing and sequencing the text. \n",
        "\n",
        "### Example good reviews:\n",
        "\n",
        "*   This is hands down the best phone I've ever had.\n",
        "*   Four stars for the food \u0026 the guy in the blue shirt for his great vibe \u0026 still letting us in to eat !\n",
        "\n",
        "### Example bad reviews:  \n",
        "\n",
        "*   A lady at the table next to us found a live green caterpillar In her salad\n",
        "*   If you plan to use this in a car forget about it.\n",
        "\n",
        "### See more reviews\n",
        "Feel free to [download the dataset](https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P) from a drive folder belonging to Udacity and open it on your local machine to see more reviews."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wr21SvWhQhvN"
      },
      "outputs": [],
      "source": [
        "# Import Tokenizer and pad_sequences\n",
        "import tensorflow as tf\n",
        "from tensorflow.keras.preprocessing.text import Tokenizer\n",
        "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
        "\n",
        "# Import numpy and pandas\n",
        "import numpy as np\n",
        "import pandas as pd\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cJOCSbdERsdc"
      },
      "source": [
        "# Get the corpus of text\n",
        "\n",
        "The combined dataset of reviews has been saved in a Google drive belonging to Udacity. You can download it from there."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kBpFip-X69Hf"
      },
      "outputs": [],
      "source": [
        "path = tf.keras.utils.get_file('reviews.csv', \n",
        "                               'https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P')\n",
        "print (path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZCT57MVGTENX"
      },
      "source": [
        "# Get the dataset\n",
        "\n",
        "Each row in the csv file is a separate review.\n",
        "\n",
        "The csv file has 2 columns:\n",
        "\n",
        "*   **text** (the review)\n",
        "*   **sentiment** (0 or 1 indicating a bad or good review)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TlyreClyS7H3"
      },
      "outputs": [],
      "source": [
        "# Read the csv file\n",
        "dataset = pd.read_csv(path)\n",
        "\n",
        "# Review the first few entries in the dataset\n",
        "dataset.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Fk5uzq4Oco7h"
      },
      "source": [
        "# Get the reviews from the csv file"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "u7uCBlAqdEzK"
      },
      "outputs": [],
      "source": [
        "# Get the reviews from the text column\n",
        "reviews = dataset['text'].tolist()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OS0mg5yoVzQL"
      },
      "source": [
        "# Tokenize the text\n",
        "Create the tokenizer, specify the OOV token, tokenize the text, then inspect the word index."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "atgLJzAiVwqB"
      },
      "outputs": [],
      "source": [
        "tokenizer = Tokenizer(oov_token=\"<OOV>\")\n",
        "tokenizer.fit_on_texts(reviews)\n",
        "\n",
        "word_index = tokenizer.word_index\n",
        "print(len(word_index))\n",
        "print(word_index)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vfh0WGmKWyjI"
      },
      "source": [
        "# Generate sequences for the reviews\n",
        "Generate a sequence for each review. Set the max length to match the longest review. Add the padding zeros at the end of the review for reviews that are not as long as the longest one."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VwyqBS2nV53o"
      },
      "outputs": [],
      "source": [
        "sequences = tokenizer.texts_to_sequences(reviews)\n",
        "padded_sequences = pad_sequences(sequences, padding='post')\n",
        "\n",
        "# What is the shape of the vector containing the padded sequences?\n",
        "# The shape shows the number of sequences and the length of each one.\n",
        "print(padded_sequences.shape)\n",
        "\n",
        "# What is the first review?\n",
        "print (reviews[0])\n",
        "\n",
        "# Show the sequence for the first review\n",
        "print(padded_sequences[0])\n",
        "\n",
        "# Try printing the review and padded sequence for other elements."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "l09c03_nlp_prepare_larger_text_corpus.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
